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Abstract 

Background: Tandem mass spectrometry-based database searching is currently the main method for protein 
identification in shotgun proteomics. The explosive growth of protein and peptide databases, which is a result 
of genome translations, enzymatic digestions, and post-translational modifications (PTMs), is making computational 
efficiency in database searching a serious challenge. Profile analysis shows that most search engines spend 
50%-90% of their total time on the scoring module, and that the spectrum dot product (SDP) based scoring module 
is the most widely used. As a general purpose and high performance parallel hardware, graphics processing units 
(GPUs) are promising platforms for speeding up database searches in the protein identification process. 

Results: We designed and implemented a parallel SDP-based scoring module on GPUs that exploits the efficient use 
of GPU registers, constant memory and shared memory. Compared with the CPU-based version, we achieved a 30 to 
60 times speedup using a single GPU. We also implemented our algorithm on a GPU cluster and achieved an approximately 
favorable speedup. 

Conclusions: Our GPU-based SDP algorithm can significantly improve the speed of the scoring module in mass 
spectrometry-based protein identification. The algorithm can be easily implemented in many database search engines such 
as XITandem, SEQUEST, and pFind. A software tool implementing this algorithm is available at http//www.comp.hkbu.edu. 
hk/~youli/ProteinByGPU.html 



Background 

High-throughput tandem mass spectrometry (referred to 
hereafter as MS/MS) based protein identification is a 
powerful method in proteomics [1]. It enables large-scale 
analysis of the protein sequence and PTMs with high sen- 
sitivity, accuracy, and throughput [2-6]. Among the MS/ 
MS data analysis methods, protein database search ap- 
proaches, such as Mascot [7], SEQUEST [8], pFind [9-11], 
XITandem [12], OMSSA [13], and Phenyx [14], have been 
the most widely used. Although much research has been 
devoted to improving the methods effectiveness by de- 
signing new scoring and validating algorithms, creating ef- 
ficient database search engines is a serious challenge for a 
number of reasons. 
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First, the number of entries in a protein sequence data- 
base increases continuously. Take, for example, IPI.Human, 
in which the number of proteins increased by almost a 
third between v3.22 and v3.49 [15]. Second, if semi- or 
non-specific digestion is considered, as it increasingly is, it 
will lead to 10 or 100 times more digested peptides, re- 
spectively, in the database, than if only specific digestion is 
considered. Third, post-translational modifications (PTMs) 
generate exponentially more modified peptides. Until re- 
cently, over 900 types of PTMs existed in the Unimod pro- 
tein modification for mass spectrometry [1]. If we choose 
ten common variable PTMs and limit the number of 
modification sites in a peptide to no more than five, 
the number of tryptic peptides of the human proteome 
will be increased over 1000 times. At the same time, 
the generation speed of the mass spectrometers has 
been steadily increasing. 

As all of the algorithms in a database search engine cal- 
culate the similarity between the experimental MS/MS 
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and the theoretical candidate MS/MS generated from the 
protein (peptide) database, one of the direct results of the 
above increases is the large scale of the scoring calcula- 
tion, which is the most computing intensive and time con- 
suming stage in the whole flow of protein identification. 
Profiling analysis shows that the scoring module takes 
around 50-90% of the total identification time in both 
pFind and X!Tandem. Thus, speeding up the scoring mod- 
ule is a promising method to increase the efficiency of 
protein identification. 

Recent studies on efficiency focus on decreasing the 
redundant scoring operations and parallelizing the scor- 
ing module. Some studies adopted indexing techniques 
to avoid unnecessary scoring. Li and Chi systematically 
explored the effect of indexing techniques and designed 
an inverted index strategy for protein identification [15]. 
Edwards and Lippert considered the problem of redun- 
dant peptides and peptide-spectrum matching and used 
a suffix tree index [16]. Tang adopted peptide and b/y ions 
indices [17]. Dutta and Chen used the nearest neighbor 
search to improve peptide-spectrum matching [18]. At the 
same time, most of the popular peptide and protein search 
engines use parallel computing technology. SEQUEST 
adopted a parallel virtual machine (PVM) to build its clus- 
ter system [19], whereas Mascot and Phenyx used a mes- 
sage passing interface (MPI). XITandem has two parallel 
strategies [20,21]. These systems have been integrated into 
higher-level application frameworks, such as web ser- 
vices, grids [22], or even cloud computing. Halligan 
has migrated XITandem and OMSSA to the Amazon 
cloud computing platform [23]. In addition, some re- 
cent systems used new hardware to increase the paral- 
lelizing of the scoring module. Bogdan and Dandass 
use a field-programmable gate array (FPGA) [24]. Hussong 
used a single graphics processing unit (GPU) to speed up 
the feature selection step [25]. Baumgardner implemented 
a spectrum library search algorithm on a single GPU [26]. 
Milloy also adopted a single GPU to speedup database 
spectral matching [27]. 

Recently, GPUs have become general purpose and high 
performance parallel hardware and provided another prom- 
ising platform for parallelizing the scoring function. GPUs 
are dedicated hardware for manipulating computer graphics. 
Due to the large demand for real-time computing and high- 
definition 3D graphics, GPUs have evolved into highly paral- 
lel many-core processors [28]. NVIDIA GTX580 is an 
example of a typical GPU architecture. GTX580 has 16 
streaming multiprocessors (SMs), and each SM has 32 
scalar processors (SPs), resulting in a total of 512 proces- 
sor cores. The SMs have a single-instruction multiple- 
thread (SIMT) mode; at any given clock cycle, each SP 
executes the same instruction, but operates on differ- 
ent data. The recent advances in computing power in 
GPUs have driven the development of general-purpose 



computing on GPUs (GPGPU), which have been used 
to accelerate a wide range of applications [29-33]. 

Considering the independence of each scoring oper- 
ation in a protein identification database search engine, 
it is reasonable to parallelize the scoring module in an 
SIMT architecture on a GPU or GPU cluster. To the best 
of our knowledge, three studies [25-27], have attempted 
to use GPUs to speed up peptide/protein identification. 
Hussong et al. [25] focused on peak selection, while [26] 
and [27] are dedicated on spectral library search. Mean- 
while, few studies have discussed peptide/protein identifica- 
tion on a GPU cluster. For this study, we choose one of the 
most widely used scoring methods, spectral dot product 
(SDP), which can be used directly or indirectly in XITandem, 
pFind, SEQUEST, etc. We conduct systematic research to 
design a parallel SDP-based scoring module for both a single 
GPU and a GPU cluster, using a general purpose parallel 
programming model, specifically, the Compute Unified 
Device Architecture (CUD A). 

Our first contribution is the design, implementation, and 
evaluation of two different parallel SDP algorithms on a 
GPU, based on the precursor mass distribution of the experi- 
mental spectrum. The precursor mass distribution describes 
the number of spectra in a group of preset consecutive mass 
windows, and marks the windows as sparse or dense. For 
the sparse mass windows, we use the GPU on-chip registers 
to minimize the memory access latency. However, due to the 
limited size of the on-chip registers, this method is not ap- 
plicable to the dense mass windows. Consequently, we de- 
sign a novel and highly efficient algorithm that treats the 
experimental and theoretical spectra as two matrices, and 
considers the scoring process as a matrix operation, and then 
makes use of the GPU on-chip shared memory together 
with the on-chip registers. Using both of the above two algo- 
rithms, we achieve a 30 to 60 times speedup compared to 
the serial version on a single CPU. 

Our second contribution is the adoption of a GPU cluster 
for protein identification that uses a novel pre-calculation 
strategy to balance the workload on each node and to de- 
crease the communication costs between the master node 
and worker nodes. We consider the operation number of 
each scoring process between the theoretical and experi- 
mental spectra as the basic task, divide the mass range into 
sub-mass ranges where the number of the basic operation 
in each sub-range is almost the same, and then dispatch 
the task to the sub-range. In the end, we obtain a favorable 
speedup on our GPU cluster that contains eight Nvidia 
GTX580 GPUs with a total of 4096 processing cores. 

Results 

All the experiments were conducted on a GPU cluster that 
included one master node (muOl) and four computing 
nodes (Fermi.1-4), as shown in Figure 1 and Table 1. All of 
the nodes had a Xeon E5620 CPU performing at 2.4 GHz, 
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Figure 1 GPU cluster architecture. A GPU cluster has one master (mu01) and four computing nodes (Fermi. 1-4). All of the nodes have a 
XeonE5620, and perform at 2.40 GHZ with two GeForce GTX580. The GTX580 has 512 cores, performs at 1.54 GHz, and has 1.54GB global 
memory with a peak bandwidth of 192.4 GB/sec. All of the nodes are connected by the GIGABIT line, and mu01 is connected to the Internet. 



and two NVidia GeForce GTX580 cards. Each GTX580 
had 512 cores, performing at 1.54 GHz and with a peak 
memory bandwidth of 192.4 GB/sec. The CPU-based pro- 
grams were developed by C++ language, and the GPU- 
based program used CUDA 4.2. 

We performed three experiments to show the speedup 
effect, using the searching parameters in Table 2. The MS/ 
MS data in Exp.l were downloaded from a previously re- 
ported dataset [34], generated from QSTAR instrument. 
This dataset was used to evaluate the performance of the 
target-decoy approach, which was one of the most classic 
and important works for the evaluation of peptide identifi- 
cation results. In Exp.2, the MS/MS data were generated by 
another liquid chromatography/tandem mass spectrometry 
(LC/MS/MS) experiment that analyzed a mixture of hu- 
man serum proteins. In Exp.3, the MS/MS data and search- 
ing parameters were the same as Exp.l, but were searched 
against UniProtKB/Swiss-Prot (2013.05.15) database. 

We mainly considered the scale of the spectra and 
protein database to test the speed, while Exp.l, Exp.2 
and Exp.3 could be considered as small, large and medium 
computing scale respectively. We also considered the 
mass distribution of the matched spectrum-peptide, which 
was a concern when we designed the speedup algorithm, 
and analyzed in the next section: SDP on a single GPU, 

SDP on a single GPU 

We first performed the experiment on X!Tandem 
(win.2011. 12.01) and pFind (V2.6.0) using the Fermi.l 

Table 1 GPU cluster specifications 



Node 



MuOl 



Fermil -4 



GPU N/A 2 x GTX580 

CPU 2 x XeonE5620(2.40GHz)/5.86GT/1 2 M/1 066 

Memory 6x4G Registered ECC 1333 MHz DDR3 

Others 1 x 1 000G 3.5inch SATA, 2 x 1 000 M Ethernet 



(Xeon E5620 CPU). Both XITandem and pFind versions 
were serial and single thread program adopting CPU only. 
The running time results were shown in Table 3. In Exp.l, 
XITandem spent 45 minutes in total, of which 26 minutes 
were for the scoring function, namely "dotQ" in the source 
code, which computed the SDP and occupied 58% of the 
total time. Similarly, pFind used 18 out of 22 minutes, 

Table 2 Database searching parameters 

Exp. 1 



Exp. 2 



Exp.3 



Instrument 
Spectra 
Database 
Enzyme 
Tolerance 
Modifications 



Instrument 

Spectra 

Database 

Enzyme 

Tolerance 

Modifications 



Instrument 

Spectra 

Database 

Enzyme 

Tolerance 

Modifications 



QSTAR 
46195 DTA files 

Yeast (13434 proteins, target + reversed) 

Trypsin (max missed cleavage sites = 2) 

Precursor: 0.2 Da; Fragment: 0.2 Da 

Fixed: Carbamidomethylation (C) 

Variable: Oxidation (M), Phosphorylation 
(S, T, Y) 

LTQ 

43493 DTA files 

IPl.Human v3.49 ( 148034 protein, 
target + reversed) 

Trypsin (max missed cleavage sites = 2) 

Precursor: 3 Da; Fragment: 0.5 Da 

Fixed: Carbamidomethylation (C) 

Variable: Oxidation (M), Phosphorylation 
(S, T, Y) 

QSTAR 

46195 DTA files 

UniprotKB/Swiss-Prot (540171 proteins) 

Trypsin (max missed cleavage sites = 2) 

Precursor: 0.2 Da; Fragment: 0.2 Da 

Fixed: Carbamidomethylation (C) 

Variable: Oxidation (M), Phosphorylation 
(S, T, Y) 
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Table 3 Time usage of database searching (minutes) 



Search engines 


Time distribution 


Exp. 1 


Exp. 2 


Exp. 3 


XITandem 


Total time 


45 


1011 


253 




Scoring time 


24 


566 


138 




Scoring time percentage 


54% 


56% 


55% 


pFind 


Total time 


22 


601 


132 




Scoring time 


18 


530 


107 




Scoring time percentage 


82% 


89% 


81% 



Note: pFind and XITandem both use a one-step mode. 



which was 82% of the total time, on the scoring function 
"ksdpQ". The time distribution in Exp.2 shared the same 
characteristics, and demonstrated the potential for in- 
creasing efficiency by parallelizing the scoring module. 
It is also worth pointing out that many optimization 
methods can be used to speed up the modules other 
than the scoring module [14,16-18]. Our work is com- 
plementary to those methods. 

We implemented single thread/process CPU SDP ver- 
sion Algorithm 1, and serially executed on the Fermi. 1. 
We also implemented single GPU SDP version Algorithm 
2 and 3, and executed on the Fermi. 1. Ignoring the time of 
reading database and spectra files for all the algorithms, the 
speedup from a single GPU (Fermi. 1) varied from thirty to 
sixty-five, as shown in Table 4. Exp.l achieved a 31 times 
speedup, Exp.2 achieved a 65 times speedup, Exp.3 achieved 
a 29 times speedup. The speedup effect resulted from 
the parallel scoring and memory access optimization. 

We implemented these two algorithms on the GPU 
with the following strategy. First, we calculated the pre- 
cursor mass distribution of the experimental spectra, 
and counted the spectra number in a consecutive group 
of 1 Da mass windows from 300 Da to 4000 Da, like 
300-301 Da, 301-302 Da, and 3999-4000 Da. 
Second, we divided the mass window into two categor- 
ies: if the number of experimental spectra was not larger 
than a preset threshold number in a mass window, then 
it was a sparse window; otherwise it was a dense window. 
Take Exp.l as an example. The threshold was set to two and 
8.3% of the mass windows are sparse. For dense windows, 
the average number of experimental spectra was twenty- 
one. Third, we adopted Algorithm 2 to handle sparse win- 
dows and used Algorithm 3 to handle dense windows. 

Algorithm 2 exploited the on-chip registers, to decrease 
the memory access latency. On GTX580, each SM has 



Table 4 Speedup effect of SDP using a single GPU 



Search engines 


Exp. 1 


Exp. 2 


Exp.3 


CPU 


968 s 


32587 s 


5529 s 


GPU 


35 s 


502 s 


191 s 


Speedup 


27 


65 


29 



Note: in Exp.1, threshold is set to 1, whereas in Exp. 2, threshold is set to 2. 



32,768 registers, and registers have 32 bits. Each theoretical 
spectrum need around 16 registers on average, in Exp.l. 
We can infer that each SM could store around 2,048 spec- 
tra, and 16 SM could deal with 32,768 experimental spec- 
tra on the register. In addition, Algorithm 2 stored the 
experimental spectra on the texture memory, which used a 
cache mechanism to decrease the memory access latency. 
Algorithm 2 also put the index file for the theoretical 
and experimental spectra mating on the constant memory 
to further decrease the reading latency. Consequently, 
Algorithm 2 read the theoretical spectra from global mem- 
ory only once; then it read experimental spectra from global 
memory also once, and read from texture less than thresh- 
old times from global memory, which was two in Exp.l and 
one in Exp.2; then it loaded theoretical spectra into the 
register, and calculated the score of these experimental 
spectra. We presented the idea in detail in Algorithm 2. 

For the experimental spectrum in the dense mass win- 
dow, Algorithm 2 will not work because there are not 
enough registers. Instead, we designed Algorithm 3 to 
adopt a shared memory that is larger than the registers; the 
reading latency is also much better than reading from the 
global memory. Algorithm 3 considered the spectrum in 
the dense mass window as a matrix, and loaded the theor- 
etical spectrum matrix, tile by tile, into the shared memory. 
Thus it accessed the global memory only once for each 
theoretical spectrum. Consequently, on average, Algorithm 
3 read both the experimental and theoretical spectra from 
global memory once. If we still use the Algorithm 2 here, 
we would read the theoretical spectrum from local mem- 
ory, for 21 times in Exp.l. on average. Besides, Algorithm 3 
also used the constant memory for the index file. 

To illustrate the utility of our mixed design strategy, we 
also compared the speedup effect of adopting Algorithm 2 
or Algorithm 3 alone. In Exp.2, Algorithm 2 spent 
8,427 seconds while Algorithm 3 spent 936 seconds. 
Both were not as efficient as the mixed strategy. Be- 
sides, Algorithm 2 performed much worse when it had 
to reading from the local and global memory multiple 
times, since the reading latency of local and global 
memory is much longer than that of the register. On 
the other hand, Algorithm 3 mainly made use of the 
shared memory, whose reading latency is small than 
local and global memory, and a little longer than the 
register when we avoided the banking conflict. In the 
Methods section, we showed how Algorithm 3 made 
the most use of the shared memory. 

SDP on the GPU cluster 

We designed two parallel SDP algorithms: one adopted 
CPUs alone, namely CPU Cluster version, while the other 
one adopted both CPUs and GPUs, namely GPU cluster 
version. We divided the parallel SDP algorithm into two 
steps: in the step 1, we assigned a sub task to one 
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computing node and prepared the database and spectra in 
each node; in the step 2, we calculated the SDP in each 
node on its own task. We designed a pre-calculation strat- 
egy for the task assignment, adopted algorithm 1 for cal- 
culating the SDP on the CPU cluster, and used Algorithm 
2 and 3 for calculating the SDP on the GPU cluster. 

In the experiment, we copied all the databases and 
spectra in each node (Fermil-4) first, calculated the sub 
task on muOl, sent messages (MPI) to each node, and 
calculated the SDP. As shown in Table 5, the speedup of 
the GPU cluster version, compared with CPU cluster 
version, varied from thirty to seventy times. The speedup 
came from both of the two steps. In the step 1 for pre- 
calculation, we got eight times speedup in Exp.l, and 
thirty times in Exp.2. In the step 2 for SDP calculation, 
we got 35 times speedup in Exp.l, and 71 times in 
Exp.2, resulting from the same reason in the previous sec- 
tion, SDP on the single GPU. The time consumption of 
step 1 was less than 10% in the CPU cluster versions, and 
the direct reason of the above speedup came from the sec- 
ond step. On the other hand, the strategy in step 1 created 
a promising overflow balance and achieved a favorable 
speedup in both the CPU- and GPU-cluster versions, 
compared to the single node version, as shown in Table 6. 

The pre-calculation strategy first calculated the oper- 
ation numbers of each scoring between the experimental- 
and theoretical- spectrum in the mass window, where any 
addition or multiplication was considered to be one oper- 
ation. The result of Exp.l was shown in Figure 2. The 
Methods section presented the detailed calculation algo- 
rithm. Based on this operation distribution, the work can 
be equally divided into N G mass ranges, where N G stood 
for the number of GPUs in our cluster. In each mass 
range, each GPU got nearly the same work, and this en- 
sured a good workload balance. The cost of our strategy 
was the calculation of operation number, which was nearly 
the same as the workflow of protein identification before 
the scoring stage. The time consumption is around 4% of 
the scoring time in the CPU version, and 6-10% in the 
GPU version. Obviously, the more GPU nodes we adopt, 
the lower cost this strategy achieves. 

Normally, the strategy for the cluster is based on the 
spectrum. The master node sends a preset number of 
spectra to each worker; if one worker has finished its 
spectra, then the worker asks for the next group from 

Table 5 Speedup effect of SDP using the GPU cluster 



the master node. However, the amount of work on each 
worker node might be significantly different. In Exp.l, 
the experimental spectrum in the precursor mass win- 
dow with 1105.5 ~ 1105.6 Da, scored with 3157 theoret- 
ical spectra, whereas the experimental spectrum with 
precursor mass 522.3 ~ 522.4 Da scored with 23 theoret- 
ical spectra. Another more careful strategy is based on 
the scoring number of each spectrum; each worker deals 
with the same number of the scoring process. However, 
different scoring process could have very different opera- 
tions. For example, assuming peaks are fully matched, 
the SDP operation number is 20 for a matched spectrum 
pair with 5 hit peaks, whereas the number is 200 for a 
matched spectrum pair with 50 hit peaks. As a result, 
the above methods do not balance the workload on each 
worker very well. The communication between the mas- 
ter and the worker is also higher than in our strategy. 

Results and discussion 

GPU-SDP does not compromise on the accuracy, and 
can be easily integrated into many search engines. Be- 
sides, it can also be very easily enhanced to support 
other similar scoring methods, such as XCorr, KSDP, 
or other probability-based methods. In the future, we 
will implement a complete GPU cluster-based search 
engine for protein identification, and the estimated ini- 
tial effect could be seen in Additional file 1. 

Conclusions 

In this study, we present a novel GPU-based scoring 
method, and design and implement a SDP-based scoring 
module on a GPU platform. We achieve an approximate 
30 to 60 times speedup on a single GPU node, relative 
to a serial CPU version, and a favorable speedup effect 
with a GPU cluster with four nodes. 

Methods 

The basic notations are as follows: T and C are the the- 
oretical spectra set and the experimental spectra set; T t 
and Q are the z-th element in Tand C, stores the m/z 
values, and are described as vector T t = [ti_ 1} t L 2>-; t LNt ] 
and Q= [c i l} c L2 , c LN X where N t and N c are the 
number of different m/z values; and t L j and c L j are the 
;-th m/z value in the MS/MS spectrum; T\ and C\ are 
also the z-th element in T and C, stores the intensity 



Search Exp. 1 Exp. 2 Exp. 3 

engines Scoring Pre-calculation Scoring Pre-calculation Scoring Pre-calculation 

CPU-cluster 273 s 14 s 8991s 242 s 1568s 94 s 

GPU-cluster 13 s 3 s 136 s 8 s 62 s 5 s 

speedup 21 5 66 30 25 19 



Note: we use one CPU and one GPU in each node of the cluster. 
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Table 6 Speedup effect of the pre calculation strategy in Exp.2 



Node 
number 



CPU-cluster 



GPU-cluster 



Scoring 



Pre-calculation 



Speedup percentage 



Scoring 



Pre-calculation 



Speedup percentage 



one 
two 
three 
four 



32587 s 
17997 s 
12153 s 
8991 s 



242 s 



87.6% 



502 s 
279 s 
178 s 
136 s 



87.4% 
89.9% 
87.1% 



Note: speedup percentage equals to: one scoring time/ (N node scoring time + pre calculation time)/N. 



values, and are described as vector T\=[t\_ 1} t' i 2 > 
t' LNt ] and C i =[c' il) c\_ 2 > c' LNc ], where f u and c' u 
are the ;-th intensity value in the MS/MS spectrum. 

The workflow of the scoring module contains three 
steps, as shown in Algorithm 1. First, line 1 and 2 per- 
form the theoretical and experimental spectrum 
matching. For each theoretical spectrum T b the algo- 
rithm will search all of the experimental spectra whose 
precursor masses are in the peptides precursor mass 
window and will get C . We adopt the spectrum hash 
indexing technology presented in our previous study 
[15] to find the matched spectrum in O(l) complexity, 
where the cost of the indexing is 0(|C|). 

Second, lines 4-7 conduct the peak matching of each 
matched theoretical and experimental spectrum pair. 
For each peak in the theoretical spectrum, the algorithm 



will search for the first matched peak in the experi- 
mental spectrum and get T\ = t\_ 2l t\_ N ] and 
C' i =[c' il , c' i 2 i c'i_n]> where N is the number of 
matched peaks, and t\j and c\j are the intensity of 
the ;-th matched peak {t\j and c\j could also be val- 
ued as 1). We again adopt the spectrum peak hash index- 
ing technology from our previous study [15] to find the 
matched peak in O(l) complexity. The cost of the index- 
ing is 0(N C ), and the complexity of the peak matching is 
0(N c + N t ). 

Third, lines 6, inside the second step, computes the 
SDP value by the matched peaks for each matched the- 
oretical and experimental spectrum pair. The SDP value 
is defined as Equation (1), where N is the number of hit 
peaks. Based on the above three steps, the whole com- 
putation complexity is 0(|C| + | T\ \ C\ ( N c + N t )). 



Algorithm 1: CPU-based SDP 

//input: the group theoretical and experimental 
// spectrum vectors 
//output: the top one score of each experimental 
// spectrum 

// C : the matched experimental spectra of the 

// theoretical spectra by precursor 

// U_ m , t'i_ m : the m/z and intensity value of the mth 

//element of a theoretical spectrum T\ 

II Cj_ n , c'j_ n : the m/z and intensity value of the n th 

//element of a experimental spectrum C 7 

1 . for each T t in T 

2. hash search C, got C 

3. for each Cy in C 

4. for each t i m in T t 

5. hash search t ijn in C 7 , got Cj_ n 

6. SDP Score += t\_ m x c)_ n 

7. end of for 

8. Max_Score = Max(SDP Score, Max_Score) 

9. end of for 

10. end of for 
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Figure 2 The number of operations in each 0.1 Da mass window, from 300 Da-4000 Da, in Exp. 1. The x-axis stands for the mass range; 
divide the mass range, from 300 Da to 4000 Da, into 36000 equal-sized 0.1 Da mass windows. The y-axis stands for the operation number between 
the experimental and theoretical spectrum in each mass window. 




SDP =< T' v C) >= ^ i=l t'ic'i (1) 



SDP on the single GPU 

In the CUDA model, the GPU is considered a coproces- 
sor that is capable of executing a large number of 
threads in parallel The GPU threads are organized into 
thread blocks, and each block of threads is executed con- 
currently on one streaming multiprocessor (SM). Each SM 
has four different types of on-chip memory, namely regis- 
ters, shared memory, constant cache, and texture cache 
[31]. Constant cache and texture cache are both read-only 
memories shared by all of the scalar processors (SPs). Off- 
chip memories, such as local memory and global memory, 
have more space but relatively long access latency, usually 
400 to 600 clock cycles [35]. The properties of the different 
types of memory are summarized in [35,36]. In general, the 
scarce registers and shared memory should be carefully used 
to amortize the global memory latency cost. 

Our first SDP algorithm on a GPU is written so that 
each thread deals with one theoretical spectrum, scoring 
with its entire matched experimental spectrum, as shown 
is Algorithm 2. The differences between Algorithms 1 and 
2 are as follows. First, Algorithm 2 unfolds the first for in 
Algorithm 1, by assigning each theoretical spectrum to a 
thread, which decreases the time consumption signifi- 
cantly as many threads are working in parallel. Second, 
Algorithm 2 merges the peak matching and SDP calcula- 
tion steps to decrease the space for the variable. 



Algorithm 2: GPU-based SDP 

1. i = thread 'Id; 

2. hash search C, got C 

3. for each C 7 in C 

4. for each t ijn in T t 

5. hash search t ijn in C 7 , got Cj_ n 

6. SDP Score += t\_ m x c'j_ n 

7. end of for 

8. end of for 



When implementing Algorithm 2, we first copy the 
theoretical spectrum to the global memory, then store 
the experimental spectrum on the texture memory and 
put the spectrum index file on the constant memory. 
We notice that when the spectrum dataset is small, 
including the total number and the spectrum length, 
we can use the on-chip register for the experimental 
spectrum and other variables. As Algorithm 2 reads a 
theoretical spectrum | C | times, where | C | stands 
for the number of theoretical spectra scoring the ex- 
perimental spectrum, reading from the register can 
significantly reduce the reading latency. We illustrate 
the effect in detail in the Results section. 

However, the problem in the implementation of 
Algorithm 2 is the limited size of the registers. In fact, 
users are not allowed to fully control the registers, and can 
only adopt registers when the data size is small enough. As 
the size and length of the spectrum grows, the data cannot 
be loaded into the registers and are instead stored in local 
memory or global memory, which increases the reading 
latency and decreases the performance significantly. 

In each mass tolerance window, a group of experimen- 
tal spectra will score with a group of theoretical spectra. 
Take Exp.l as an example. On average, 14,880 theoret- 
ical spectra will score with 21 experimental spectra in a 
one Dalton mass window with a range of 300 ~ 4000 Da. 
Thus, the theoretical and experimental spectra could be 
considered two matrixes, theo[\ T*\][N t ] and e^e[A/c][| 
C \], the result score could be denoted as Scor[\T |][| 
C |], and the score calculation process could share a 
similar flow as the matrix multiplication. Based on 
this observation, we design Algorithm 3 for a dense 
mass, using registers and shared memory together. 

As shown in Algorithm 3, each of the mass window 
matrixes theo[\T*\][N t ], expe[N c ][\C*\], and Scor[|7^|][| 
C*|] are partitioned into THx TW, TWx TH, and THx 
TW tiles, respectively, where N t and N c are the maximal 
length of the experimental and theoretical spectra. TH 
and TW are preset values, which could be the integral 
multiple number of the thread number in a half GPU 
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warp, 16, 32 or 64, to make the best use of the GPU 
warp mechanism. The resources of the GPU are parti- 
tioned as follows: the grid has (\C*\/TW) x (\f)/TH) 
blocks, the ID of which is noted by blockldx.y {by in 
Figure 3) and blockldx.x (bx in Figure 3); and each 
block has TH x TDimY threads, the ID of which is 
noted by threadldx.y (ty in Figure 3) and threadldx.x 
(tx in Figure 3). The computing task is dispatched as 
follows: each block calculates TDimY tiles in the Scot, 
which is noted as SR[TH][TWx TDimY]; then each 
thread computes a column of SR. For each thread, indexT 
points to the right position of the theoretical spectrum, 
which contains the following three parts as shown in line 
4: theo is the beginning address of the theoretical spectrum; 
as the height of the theo is divided by TH, blockldx.y x 
THxN t is the address of the corresponding block; and 
threadldx.y xN t adding threadldxx is the offset address 
inside the block. 

In line 5, indexC points to the right position in the experi- 
mental spectrum, which also has three parts: expe is the be- 
ginning address of the current spectrum; blockldx.x x TW 
points to the corresponding block address, as the width of 
the experimental spectrum is divided by TW; and threadldx. 



y x blocI<Dim.x adding threadldxx points to the address of 
the current thread inside the block. Obviously, the threads 
in one block would access the experimental spectrum 
in continuous addresses, which is also called coalesced 
accessing. indexR is calculated in the same way as in 
line 6 using the beginning address of the result, the row 
address, and the offset address inside the block for the 
current thread. 

In the loop from line 11 to 16, the algorithm loads a tile 
of data from the global memory to the shared memory, 
and computes the SDP score saved in TResult, which is 
stored on the on-chip registers; the loop ends when the 
whole row has been calculated. Line 17 waits for all of the 
threads to finish their work. Line 18 writes the distance 
back from TResult to SR. The details are shown in 
Figure 3, which takes the process of calculating a Scor 
[TH][TWx TDimY] as an example. It is equal to theo 
[TH][N t ] x expe[N c ][TW x TDimY] and the sequence 
is the following. Load the first tile (in blue) from the 
theo into the shared memory; score the blue tile in 
the theo with the blue tile in the expe, which is stored 
in the texture memory; accumulate the temporary re- 
sults in TResult, whose initial value is zero; then load 



Experimental 
Spectra 



Theoretical Spectra 



o - 



TILE_WIDTH 

TILE WIDTH 

< ► 




TILEWIDTH 
< ► 




Figure 3 The computing process in a dense mass window. The figure shows the calculation in one dense mass window. The result is a Scor 
[TH] [TW x TDimY], which is equal to theo[TH][N t ] x expe[N c ][TWx TDimY]. Load the first tile (in blue) from the theo into the shared memory; score 
the blue tile in the theo with the blue tile in the expe, which is stored in the texture memory; accumulate the temporary results into TResult, 
whose initial value is zero; then repeat loading the next tile (in orange), scoring and accumulating, until theo[TH][N t ] and expe[N c ][TWx TDimY] 
have all been accessed. 



Li et al. BMC Bioinformatics 2014, 15:121 
http://www.biomedcentral.eom/1 471 -21 05/1 5/1 21 



Page 9 of 1 1 



the next tile (in orange), and continue scoring and ac- 
cumulating until theo[TH][N t ] and expe[N c ][TWx 
TDimY] have been all accessed. 



Algorithm -3 : SDP based -on -the -shared memory - of the-GPUn 

//■TH Z TW: the height, and the width -of the tile: 
//■thread: the dimensions -of the block: 
// grid: -the dim ensi ons - o f the -gri d ; 

//■indexT'po'mts -to the -corresponding -theoretical spectrum; 

ff'indexC point s -t o the -corresponding - experim ent al - spectrum : 

// indexR point s t o -the corresponding result ; 

ffbl, rf,bD: stand -f oi'blockIdx z 'threadIdx z '3nd-blockDi?n 

// SMData: • stores the tile in shared memory; - 

ff'TResult stores -thetemp-scorein-theregisters; 

// SR : -st ores the -s core -in glob al m em ory ; 

//-Alast: the upper bound -address -of the -spectrum; 

// -CTile : the row point ed by indexT the length i s TW- x • TD im Y: 

//The -theoretical spectrum is on the texture; 

\.->TH*-16 7 1W<- 16 7 TDimY^2, 

2.-* dim lhread{TH z TDimY), 

3 dim grid (JW-TW, n/TH); 

4.-» indexT = expe + bI.yxm*Nt+tI.y *JVr+ tl.x. 

5 indexC = theo + blx * ZTF-+ tl, bD.x+ tf.x. 

6.^indexR = Scor -+-bly xTH*b+ blx* TW+ tIy* bD.x + tl.x. 

7 SMData\TW] [TH] in shared memory; 

8 TResult\TW\ in Tegisters ; 

9.-* Alast indexT-tr -Nt, 

10*do 

1L*{ 

1 2 . + — L oad tiata from -global memory to SMData; 

13 . -» — indexT is added by TW\ 

14. -* - for each - co/; ; ; ; ; ; - e lime ; ^^consumptione ; ch bio ^haying - 
32, -64 and 1 2 8 iJ^eads through our ^pa^ents. - 

n^'-SM)ataf^ 
15 + 77?eraA[f| ^=peakMatch&dot(SMdata[i] z CTile) 

16. -*} while (indexT< AIast)i 

17. * syncthreadsO; 

1 8. AVritet ZRasttftbackto Scor: 



The main purpose of Algorithm 3 is to decrease the 
global memory access time and latency by loading the 
theoretical spectrum into the shared memory, tile by tile. 
Thus, Algorithm 3 reads each theoretical spectrum from 



global memory only once, the same as Algorithm 2. The 
key feature of Algorithm 3 is how efficiently it accesses 
the global memory and shared memory; this is achieved 
by adopting coalescing reading that accesses sixteen 
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continuous addresses for the threads in a half warp to 
avoid the bank conflict. 



SDP on the GPU cluster 

On the GPU cluster, as each node could adopt 
Algorithm 2 and 3, the main concern is how to dispatch 
the work to each node and achieve a high workload bal- 
ance. We design a new complete pre-calculation strategy 
to make each node work on nearly the same task. In this 
strategy, we first run the workflow of protein identifica- 
tion before the scoring stage to get experimental and 
theoretical matching results. These results tell us how 
many peptides each spectrum will score with, as shown 
in Algorithm 4, line 1-4. 

Second, in line 5 of Algorithm 4, we calculate the op- 
eration number for each SDP scoring in the following 
way: d e + d t + hitcount x 2, where d e and d t are the di- 
mension of the experimental and theoretical spectra, 
and hitcount is the number of hit peaks. d e + d t stands 
for the peak matching step in Algorithm 2, and hit- 
count x 2 stands for the dot product step in Algorithm 
2. As a result, we get the operation number for each 
mass range, such as one Dalton, from 300 Dalton to 
4000 Dalton, based on the experimental and theoret- 
ical precursor mass, matching results, and each SDP 
operation. 



Algorithm 4: GPU-based pre-calculation strategy 

IIMin: the minimum mass, set to 300 Da; 

//Max:the maximum mass, set to 4000 Da; 

//Win: the range of each mass window, set to 1 Da; 

l/Oper [{Max-Min)IWin\. the array stores the operation 

//number of in each mass window; 

1. i = thread Id; 

2. hash search C, got C 

3. for each C 7 in C 

4. hash search t m in C 7 , got HitCou 

5. Oper[Cj.mass-Min\+=Ti.len+Cj.len+2xHitCou ', 

6. end of for 



Third, we dispatch the task by mass range and give 
each node the same amount of work. As shown in 
Algorithm 5, line 1-3 calculate the total number of op- 
erations; line 4 gets the average number of operations 
on each GPU node; line 5-10 travers the Oper array; 
when the temporal summary of operation exceeds the 
average number WorkerOper, Algorithm 5 call Algorithm 
2 or 3, to deal with the current mass range, which is 
Oper[p] to Oper[k]. 

The overhead of the pre-calculation strategy is also 
very low, as shown in the Results section. After the 



calculation, the master node only transfers data to the 
computing node once, and this lowers the communica- 
tion cost. 

Algorithm 5: CPU-based Master strategy 

1 1 sum: the total number of operations in the scoring process; 
IIWorkerNum: the number of nodes in the GPU cluster; 
1 1 WorkerOper: the number of operations in each node; 
//sum ': the temp total number of each mass range; 
llj: the current ID of the GPU node; 
lip: the lower bound of mass range in Oper array; 
Ilk: the higher bound of mass range in Oper array; 
HCallWorker: call the node, work with Oper\p] to Oper[k] 

1 . for each i in Oper 

2. sum += Oper[i] ', 

3. end of for 

4 . WorkerOper = sum/ WorkerNum ; 

5. for each k in Oper 

6. sum ' += Oper[k] 

7. if sum' > WorkerOper 

8. CallWorker(j++, Oper\p], Oper[k]); 

9. sum ' = 0, p = k; 

10. end of if 

11. end of for 



Additional file 



Additional file 1: Initial accelerating effect of the peptide 
identification search engine using GPUs. 
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